Visualization of Public Trees in Vancouver#
The Vancouver trees dataset contains a listing of public trees on boulevards in the City of Vancouver and provides data on tree coordinates, species and other related characteristics.
For more information, see: https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id&disjunctive.on_street&disjunctive.neighbourhood_name.
In this example, I investigate the top 10 trees present in the dataset, and look at their prevalence within the city (which neighbourhoods they can be found in) and how the distribution of these trees (ie. how many are being planted each year, of each species) has changed over time. In addition, I look at how tree properties (diameter and height) vary between the species and neighbourhoods.
I use a combination of the following plots:
heat map
bar chart
line chart
geographic map
scatter plot
Description and Review of Data#
# Import libraries needed for this assignment
import altair as alt
import pandas as pd
# Read in the file. Let's immediately parse the "date_planted" column into DateTime dtype.
trees_df = pd.read_csv('https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csv', parse_dates=["date_planted"])
trees_df.head(10)
| Unnamed: 0 | std_street | on_street | species_name | neighbourhood_name | date_planted | diameter | street_side_name | genus_name | assigned | ... | plant_area | curb | tree_id | common_name | height_range_id | on_street_block | cultivar_name | root_barrier | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10747 | W 20TH AV | W 20TH AV | PLATANOIDES | Riley Park | 2000-02-23 | 28.5 | EVEN | ACER | N | ... | 15 | Y | 21421 | NORWAY MAPLE | 4 | 0 | NaN | N | 49.252711 | -123.106323 |
| 1 | 12573 | W 18TH AV | W 18TH AV | CALLERYANA | Arbutus-Ridge | 1992-02-04 | 6.0 | ODD | PYRUS | N | ... | 7 | Y | 129645 | CHANTICLEER PEAR | 2 | 2300 | CHANTICLEER | N | 49.256350 | -123.158709 |
| 2 | 29676 | ROSS ST | ROSS ST | NIGRA | Sunset | NaT | 12.0 | ODD | PINUS | N | ... | 7 | Y | 154675 | AUSTRIAN PINE | 4 | 7800 | NaN | N | 49.213486 | -123.083254 |
| 3 | 8856 | DOMAN ST | DOMAN ST | AMERICANA | Killarney | 1999-11-12 | 11.0 | EVEN | FRAXINUS | N | ... | 7 | Y | 180803 | AUTUMN APPLAUSE ASH | 4 | 6900 | AUTUMN APPLAUSE | N | 49.220839 | -123.036721 |
| 4 | 21098 | EAST BOULEVARD | EAST BOULEVARD | HIPPOCASTANUM | Shaughnessy | NaT | 15.5 | ODD | AESCULUS | Y | ... | N | Y | 74364 | COMMON HORSECHESTNUT | 4 | 5200 | NaN | N | 49.238514 | -123.154958 |
| 5 | 17458 | BUTE ST | BUTE ST | PERSICA | West End | 2012-04-05 | 3.0 | EVEN | PARROTIA | N | ... | C | Y | 233622 | VANESSA PERSIAN IRONWOOD | 1 | 1100 | VANESSA | N | 49.281906 | -123.133076 |
| 6 | 1476 | PRESTWICK DRIVE | NASSAU DRIVE | CAMPESTRE | Victoria-Fraserview | NaT | 12.0 | ODD | ACER | N | ... | 15 | Y | 105171 | HEDGE MAPLE | 3 | 1700 | NaN | N | 49.217522 | -123.071311 |
| 7 | 5120 | FLEMING ST | FLEMING ST | OFFICINALIS | Kensington-Cedar Cottage | 2001-04-02 | 3.0 | EVEN | MAGNOLIA | N | ... | N | Y | 187792 | CHINESE MAGNOLIA | 2 | 3700 | NaN | N | 49.251127 | -123.071912 |
| 8 | 18338 | W PENDER ST | W PENDER ST | PALUSTRIS | Downtown | 1999-12-17 | 8.0 | ODD | QUERCUS | N | ... | C | Y | 104016 | PIN OAK | 1 | 100 | NaN | N | 49.281303 | -123.108253 |
| 9 | 28279 | MATAPAN CRESCENT | MATAPAN CRESCENT | ZUMI | Renfrew-Collingwood | 2008-03-13 | 3.0 | ODD | MALUS | N | ... | 12 | Y | 102612 | REDBUD CRABAPPLE | 1 | 3200 | CALOCARPA | Y | 49.257272 | -123.030023 |
10 rows × 21 columns
trees_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 5000 non-null int64
1 std_street 5000 non-null object
2 on_street 5000 non-null object
3 species_name 5000 non-null object
4 neighbourhood_name 5000 non-null object
5 date_planted 2363 non-null datetime64[ns]
6 diameter 5000 non-null float64
7 street_side_name 5000 non-null object
8 genus_name 5000 non-null object
9 assigned 5000 non-null object
10 civic_number 5000 non-null int64
11 plant_area 4950 non-null object
12 curb 5000 non-null object
13 tree_id 5000 non-null int64
14 common_name 5000 non-null object
15 height_range_id 5000 non-null int64
16 on_street_block 5000 non-null int64
17 cultivar_name 2658 non-null object
18 root_barrier 5000 non-null object
19 latitude 5000 non-null float64
20 longitude 5000 non-null float64
dtypes: datetime64[ns](1), float64(3), int64(5), object(12)
memory usage: 820.4+ KB
There are 5000 entries within the data frame, of type int64, object and float64 (and I have changed the date_planted column to datetime64). Columns “data_planted”, “plant_area”, and “cultivar_name” contain null or NaN values. Specifically “date_planted” and “cultivar_name” have very many values missing; it may therefore be better to drop these columns - but that, of course, depends on the questions of interest and what we want to explore in our data analysis. Given that I want to investigate how the number of trees of each species being planted each year has changed over time, I will NOT drop the date_planted column.
# Let's see some summary statistics
trees_df.describe()
| Unnamed: 0 | date_planted | diameter | civic_number | tree_id | height_range_id | on_street_block | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|
| count | 5000.000000 | 2363 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.00000 | 5000.000000 | 5000.000000 | 5000.000000 |
| mean | 14861.920400 | 2003-09-06 04:03:08.912399488 | 12.340888 | 2975.707600 | 128682.584600 | 2.73440 | 2960.227000 | 49.247349 | -123.107128 |
| min | 2.000000 | 1989-10-31 00:00:00 | 0.000000 | 2.000000 | 36.000000 | 0.00000 | 0.000000 | 49.202783 | -123.220560 |
| 25% | 7192.750000 | 1997-11-06 00:00:00 | 4.000000 | 1300.500000 | 61321.500000 | 2.00000 | 1300.000000 | 49.230152 | -123.144178 |
| 50% | 14870.000000 | 2003-02-12 00:00:00 | 10.000000 | 2639.000000 | 130130.500000 | 2.00000 | 2600.000000 | 49.247981 | -123.105861 |
| 75% | 22366.750000 | 2009-11-17 00:00:00 | 18.000000 | 4123.000000 | 191332.000000 | 4.00000 | 4100.000000 | 49.263275 | -123.063484 |
| max | 29992.000000 | 2019-05-07 00:00:00 | 71.000000 | 9113.000000 | 270750.000000 | 9.00000 | 9100.000000 | 49.293930 | -123.023311 |
| std | 8680.023278 | NaN | 9.266600 | 2078.580429 | 75412.260406 | 1.56957 | 2086.861052 | 0.021251 | 0.049137 |
# Finally, let's use value_counts() to see how many different "species_names" and "common_names" there are, and just to see what types of strings these columns contain.
top_trees_species_names = trees_df["species_name"].value_counts()
top_trees_species_names
species_name
SERRULATA 463
PLATANOIDES 444
CERASIFERA 396
RUBRUM 261
AMERICANA 182
...
GRANDIFLORA 1
LAEVIS 1
LOEBNERI X 1
SERRULA 1
LUTEA 1
Name: count, Length: 171, dtype: int64
top_trees_common_names = trees_df["common_name"].value_counts()
top_trees_common_names
common_name
KWANZAN FLOWERING CHERRY 383
PISSARD PLUM 295
NORWAY MAPLE 215
CRIMEAN LINDEN 152
PYRAMIDAL EUROPEAN HORNBEAM 100
...
CHINESE WINGNUT 1
ELM SPECIES 1
UMBRELLA CATALPA 1
MAGNOLIA 'MERRILL' 1
SWEETGUM SPECIES 1
Name: count, Length: 361, dtype: int64
Questions of Interest#
I want to answer the following questions in my analysis:
What is the prevalence of the top 10 tree species within the city (which neighbourhoods can they be found in)?
How has the distribution of these trees (ie. how many are being planted each year, of each species) changed over time?
Can we visualize the total tree counts per neighbourhood on a map?
In addition, I want to explore how tree properties (diameter and height) vary between the species and neighbourhoods.
Question 1. What is the prevalence of the top 10 tree species within the city (which neighbourhoods can they be found in)?#
As seen earlier in this assignment (and below), the following are the top ten species: SERRULATA, PLATANOIDES, CERASIFERA, RUBRUM, AMERICANA, SYLVATICA, BETULUS, EUCHLORA X, FREEMANI X, and CAMPESTRE.
Let’s filter our dataframe to only look at these species.
top_trees_species_names.nlargest(10)
species_name
SERRULATA 463
PLATANOIDES 444
CERASIFERA 396
RUBRUM 261
AMERICANA 182
SYLVATICA 178
BETULUS 170
EUCHLORA X 152
FREEMANI X 127
CAMPESTRE 124
Name: count, dtype: int64
Just out of curiousity, I looked up these trees online. Serrulata is the “Japanese cherry”, Platanoides the “Norway maple”, Cerasifera the “Cherry plum”, Rubrum the “Red maple”, Americana the “Linden tree”, Sylvatica the “Sour gum”, Betulus the “European hornbeam”, Euchlora the “Caucasian linden”, Freemani the “Freeman maple”, and Campestre the “Field maple.” These are all decidous trees.
top10_trees = ["SERRULATA", "PLATANOIDES", "CERASIFERA", "RUBRUM", "AMERICANA", "SYLVATICA", "BETULUS", "EUCHLORA X", "FREEMANI X", "CAMPESTRE"]
# Creating a new dataframe to populate with the top 10 species data
trees_df_top10 = pd.DataFrame(columns=trees_df.columns)
# Let's use a for-loop to filter our trees_df dataframe, and add the top 10 species to our new trees_df_top10 dataframe.
for tree in top10_trees:
trees_toadd = trees_df[trees_df["species_name"].str.contains(tree)]
trees_df_top10 = pd.concat([trees_df_top10, trees_toadd])
trees_df_top10 = trees_df_top10.reset_index()
trees_df_top10.head()
C:\Users\celle\AppData\Local\Temp\ipykernel_8360\32736908.py:7: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
trees_df_top10 = pd.concat([trees_df_top10, trees_toadd])
| index | Unnamed: 0 | std_street | on_street | species_name | neighbourhood_name | date_planted | diameter | street_side_name | genus_name | ... | plant_area | curb | tree_id | common_name | height_range_id | on_street_block | cultivar_name | root_barrier | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 19 | 17945 | W 12TH AV | W 12TH AV | SERRULATA | Kitsilano | 2008-03-13 | 9.0 | ODD | PRUNUS | ... | 20 | Y | 106587 | SHIROTAE(MT FUJI) CHERRY | 1 | 2600 | SHIROTAE | N | 49.261319 | -123.164948 |
| 1 | 21 | 28441 | ST. CATHERINES ST | E 49TH AV | SERRULATA | Sunset | NaT | 14.0 | ODD | PRUNUS | ... | 4 | Y | 44256 | KWANZAN FLOWERING CHERRY | 3 | 800 | KWANZAN | N | 49.225494 | -123.087200 |
| 2 | 42 | 24476 | W 35TH AV | W 35TH AV | SERRULATA | Shaughnessy | NaT | 11.0 | EVEN | PRUNUS | ... | 12 | Y | 33656 | KWANZAN FLOWERING CHERRY | 2 | 2000 | KWANZAN | N | 49.239992 | -123.152677 |
| 3 | 44 | 16997 | VENABLES ST | VERNON DRIVE | SERRULATA | Strathcona | NaT | 22.0 | ODD | PRUNUS | ... | 7 | Y | 115638 | UKON JAPANESE CHERRY | 3 | 800 | UKON | N | 49.277064 | -123.079379 |
| 4 | 60 | 1292 | CAMOSUN ST | CAMOSUN ST | SERRULATA | Dunbar-Southlands | NaT | 16.0 | ODD | PRUNUS | ... | N | Y | 204485 | KWANZAN FLOWERING CHERRY | 2 | 4400 | KWANZAN | N | 49.246430 | -123.196900 |
5 rows × 22 columns
# Let's plot a heat map to see which trees are present in each neighbourhood, and how many.
# I've added a tooltip to help see how many trees exactly are denoted in the heat map. I've also added a select tool, to enable the selection of one of the neighbourhoods.
select_neighbourhood_click = alt.selection_point(encodings=["y"], on='click', nearest=True)
tree_plot = alt.Chart(trees_df_top10).mark_rect().encode(alt.X('species_name', title="Species name"), alt.Y('neighbourhood_name', title="Neighboorhood name"), color=('count()'), tooltip=[alt.Tooltip("count()", title="Number of trees")], opacity=alt.condition(select_neighbourhood_click, alt.value(0.9), alt.value(0.2))).properties(title="Count of trees within neighbourhoods")
tree_plot.add_params(select_neighbourhood_click)
In the EDA, I initially use a simple mark_rect plot to visualize this data. I quickly realized that using a heat map would be better, because it would allow me to not only see if a species is present in a neighbourhood, but how many trees of the species are present.
Although the above plot demonstrates that there are certain neighbourhoods with greater tree counts than others, it also shows that almost all of the neighbourhoods have at least one exemplar of each of the top 10 tree species. It seems as though these trees are pretty well distributed throughout the city!
Question 2. How has the distribution of these trees (ie. how many are being planted each year, of each species) changed over time?#
Has this been different over the different neighbourhoods?#
# First, let's filter out the trees that do not have a "date_planted" value
trees_filtered_df = trees_df_top10[~pd.isnull(trees_df_top10["date_planted"])].reset_index()
trees_filtered_df.head()
| level_0 | index | Unnamed: 0 | std_street | on_street | species_name | neighbourhood_name | date_planted | diameter | street_side_name | ... | plant_area | curb | tree_id | common_name | height_range_id | on_street_block | cultivar_name | root_barrier | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 19 | 17945 | W 12TH AV | W 12TH AV | SERRULATA | Kitsilano | 2008-03-13 | 9.00 | ODD | ... | 20 | Y | 106587 | SHIROTAE(MT FUJI) CHERRY | 1 | 2600 | SHIROTAE | N | 49.261319 | -123.164948 |
| 1 | 8 | 114 | 1978 | SLOCAN ST | SLOCAN ST | SERRULATA | Renfrew-Collingwood | 2011-01-18 | 3.25 | EVEN | ... | B | Y | 21236 | KWANZAN FLOWERING CHERRY | 1 | 3400 | KWANZAN | N | 49.253228 | -123.049443 |
| 2 | 16 | 253 | 10562 | CHALDECOTT ST | CHALDECOTT ST | SERRULATA | Dunbar-Southlands | 2009-04-24 | 12.00 | EVEN | ... | N | Y | 15443 | KWANZAN FLOWERING CHERRY | 2 | 4400 | KWANZAN | N | 49.247000 | -123.192180 |
| 3 | 18 | 263 | 15849 | W 30TH AV | W 30TH AV | SERRULATA | Arbutus-Ridge | 1989-11-08 | 24.00 | ODD | ... | 7 | Y | 123108 | KWANZAN FLOWERING CHERRY | 4 | 2700 | KWANZAN | N | 49.245210 | -123.167140 |
| 4 | 22 | 300 | 183 | W 40TH AV | W 40TH AV | SERRULATA | Shaughnessy | 1996-05-31 | 13.50 | ODD | ... | 10 | Y | 168916 | KWANZAN FLOWERING CHERRY | 2 | 1600 | KWANZAN | N | 49.235750 | -123.144273 |
5 rows × 23 columns
Our initial trees_df_top10 contained 2497 trees. Now we have only 1053 trees in our dataframe.
# Let's add a column to our trees_filtered_df to extract the year a tree was planted from the "date_planted" column.
trees_filtered_df = trees_filtered_df.assign(year_planted = trees_filtered_df.date_planted.dt.year)
trees_filtered_df.head()
| level_0 | index | Unnamed: 0 | std_street | on_street | species_name | neighbourhood_name | date_planted | diameter | street_side_name | ... | curb | tree_id | common_name | height_range_id | on_street_block | cultivar_name | root_barrier | latitude | longitude | year_planted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 19 | 17945 | W 12TH AV | W 12TH AV | SERRULATA | Kitsilano | 2008-03-13 | 9.00 | ODD | ... | Y | 106587 | SHIROTAE(MT FUJI) CHERRY | 1 | 2600 | SHIROTAE | N | 49.261319 | -123.164948 | 2008 |
| 1 | 8 | 114 | 1978 | SLOCAN ST | SLOCAN ST | SERRULATA | Renfrew-Collingwood | 2011-01-18 | 3.25 | EVEN | ... | Y | 21236 | KWANZAN FLOWERING CHERRY | 1 | 3400 | KWANZAN | N | 49.253228 | -123.049443 | 2011 |
| 2 | 16 | 253 | 10562 | CHALDECOTT ST | CHALDECOTT ST | SERRULATA | Dunbar-Southlands | 2009-04-24 | 12.00 | EVEN | ... | Y | 15443 | KWANZAN FLOWERING CHERRY | 2 | 4400 | KWANZAN | N | 49.247000 | -123.192180 | 2009 |
| 3 | 18 | 263 | 15849 | W 30TH AV | W 30TH AV | SERRULATA | Arbutus-Ridge | 1989-11-08 | 24.00 | ODD | ... | Y | 123108 | KWANZAN FLOWERING CHERRY | 4 | 2700 | KWANZAN | N | 49.245210 | -123.167140 | 1989 |
| 4 | 22 | 300 | 183 | W 40TH AV | W 40TH AV | SERRULATA | Shaughnessy | 1996-05-31 | 13.50 | ODD | ... | Y | 168916 | KWANZAN FLOWERING CHERRY | 2 | 1600 | KWANZAN | N | 49.235750 | -123.144273 | 1996 |
5 rows × 24 columns
# Let's take our trees_filtered dataframe and group by species_name and year.
trees_by_species_and_year = trees_filtered_df.groupby(["species_name", trees_filtered_df.date_planted.dt.year]).size().reset_index().rename(columns = {0: "tree_count"})
trees_by_species_and_year
| species_name | date_planted | tree_count | |
|---|---|---|---|
| 0 | AMERICANA | 1992 | 1 |
| 1 | AMERICANA | 1993 | 5 |
| 2 | AMERICANA | 1994 | 6 |
| 3 | AMERICANA | 1995 | 1 |
| 4 | AMERICANA | 1996 | 3 |
| ... | ... | ... | ... |
| 231 | SYLVATICA | 2014 | 7 |
| 232 | SYLVATICA | 2015 | 1 |
| 233 | SYLVATICA | 2017 | 1 |
| 234 | SYLVATICA | 2018 | 4 |
| 235 | SYLVATICA | 2019 | 2 |
236 rows × 3 columns
# When we check the dataframe info, we can see that during the above transformations, the year_planted column got changed to int64 dtype.
trees_filtered_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1053 entries, 0 to 1052
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 level_0 1053 non-null int64
1 index 1053 non-null int64
2 Unnamed: 0 1053 non-null object
3 std_street 1053 non-null object
4 on_street 1053 non-null object
5 species_name 1053 non-null object
6 neighbourhood_name 1053 non-null object
7 date_planted 1053 non-null datetime64[ns]
8 diameter 1053 non-null float64
9 street_side_name 1053 non-null object
10 genus_name 1053 non-null object
11 assigned 1053 non-null object
12 civic_number 1053 non-null object
13 plant_area 1044 non-null object
14 curb 1053 non-null object
15 tree_id 1053 non-null object
16 common_name 1053 non-null object
17 height_range_id 1053 non-null object
18 on_street_block 1053 non-null object
19 cultivar_name 904 non-null object
20 root_barrier 1053 non-null object
21 latitude 1053 non-null float64
22 longitude 1053 non-null float64
23 year_planted 1053 non-null int32
dtypes: datetime64[ns](1), float64(3), int32(1), int64(2), object(17)
memory usage: 193.5+ KB
# Let's change it back to datetime, so that we don't have trouble plotting.
trees_filtered_df['year_planted'] = pd.to_datetime(trees_filtered_df['year_planted'], format='%Y')
trees_filtered_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1053 entries, 0 to 1052
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 level_0 1053 non-null int64
1 index 1053 non-null int64
2 Unnamed: 0 1053 non-null object
3 std_street 1053 non-null object
4 on_street 1053 non-null object
5 species_name 1053 non-null object
6 neighbourhood_name 1053 non-null object
7 date_planted 1053 non-null datetime64[ns]
8 diameter 1053 non-null float64
9 street_side_name 1053 non-null object
10 genus_name 1053 non-null object
11 assigned 1053 non-null object
12 civic_number 1053 non-null object
13 plant_area 1044 non-null object
14 curb 1053 non-null object
15 tree_id 1053 non-null object
16 common_name 1053 non-null object
17 height_range_id 1053 non-null object
18 on_street_block 1053 non-null object
19 cultivar_name 904 non-null object
20 root_barrier 1053 non-null object
21 latitude 1053 non-null float64
22 longitude 1053 non-null float64
23 year_planted 1053 non-null datetime64[ns]
dtypes: datetime64[ns](2), float64(3), int64(2), object(17)
memory usage: 197.6+ KB
# Now let's re-create our tree_plot on this trees_filtered_df, since we decreased the amount of our data by over half.
# Also, we want to use this plot later in a dashboard with other plots made using this reduced/filtered dataframe.
tree_plot_filtered = alt.Chart(trees_filtered_df).mark_rect().encode(
alt.X('species_name', title="Species name"),
alt.Y('neighbourhood_name', title="Neighboorhood name"),
color=alt.Color('count()'),
tooltip=[alt.Tooltip("count()", title="Number of trees")],
opacity=alt.condition(select_neighbourhood_click, alt.value(0.9), alt.value(0.2))).properties(title="Count of trees within neighbourhoods")
# Add a title with instructions for how to use the interactivity.
tree_plot_title = alt.TitleParams("Count of trees within neighbourhoods",
subtitle = "Click within the chart to select a neighbourhood to highlight.",
anchor = 'middle',
fontSize = 14,
subtitleFontSize = 12)
tree_plot_filtered = tree_plot_filtered.add_params(select_neighbourhood_click)
tree_plot_filtered = tree_plot_filtered.properties(title=tree_plot_title)
tree_plot_filtered
# Let's use a stacked bar chart to see how the distribution of different species being planted each year has changed over time.
# This type of chart enables one to see at the same time the TOTAL number of trees planted in a year, and (via coloured bars), how many trees of each species make up this total.
# I've added interactivity by enabling clicking on the legend to zone in on a particular species (one or multiple).
legend_select = alt.selection_point(fields=['species_name'], bind='legend')
total_tree_bar_plot_int = alt.Chart(trees_filtered_df).mark_bar().encode(
x=alt.X('year_planted', title="Year"),
y=alt.Y('count()', title = "Trees planted"),
color=alt.Color('species_name', scale=alt.Scale(domain=top10_trees), title="Species name"),
opacity=alt.condition(legend_select, alt.value(0.9), alt.value(0.2))).properties(title="Total trees planted per year")
total_tree_bar_plot_int = total_tree_bar_plot_int.transform_filter(select_neighbourhood_click).transform_filter(legend_select).add_params(select_neighbourhood_click, legend_select)
total_tree_bar_plot_int
# We can also make a line chart of ALL of the trees planted per year.
trees_by_year_plot = alt.Chart(trees_filtered_df).mark_line().encode(alt.X('year_planted', title=None), alt.Y('count()', title = "Trees planted"))
trees_by_year_plot
It looks like a high number of trees (between 60 and 140) were planted between the years 1992 and 2013. Then the number of trees being planted dropped drastically. It would be interesting to see how this relates to the political party in power or the funding given to the parks board… but that is not something I am exploring in this analysis.
# Let's combine the two plots above.
# As in the course notes, we can use a selection interval on the line chart to select the year range that we are interested in looking at on the bar graph that identifies the different species.
select_year = alt.selection_interval()
interval_chart = trees_by_year_plot.properties(height=50).add_params(select_year)
bar_chart = total_tree_bar_plot_int.encode(x=alt.X('year_planted', title=None, scale=alt.Scale(domain=select_year))).properties(title="", height=200)
year_chart = bar_chart & interval_chart
# Add a title with instructions for how to use the interactivity.
year_chart_title = alt.TitleParams("Total trees planted per year",
subtitle = "Click on the species name (one or multiple) to select species. Use the lower chart to select the year range to zoom in on.",
anchor = 'middle',
fontSize = 14,
subtitleFontSize = 12)
year_chart = year_chart.properties(title=year_chart_title)
year_chart
This nice visualization allows us to zone in on the particular range of years that we are interested in, and then explore which species (singular or plural) of trees were planted in those years.
Question 3. Can we visualize the total tree counts per neighbourhood on a map?#
# Following the instructions provided in the course notes, I will create a map of Vancouver.
url_geojson = 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/local-area-boundary.geojson'
data_geojson_remote = alt.Data(url=url_geojson, format=alt.DataFormat(property='features',type='json'))
data_geojson_remote
Data({
format: DataFormat({
property: 'features',
type: 'json'
}),
url: 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/local-area-boundary.geojson'
})
# Here is the base Vancouver map.
vancouver_map = alt.Chart(data_geojson_remote).mark_geoshape(
color = 'gray', opacity= 0.5, stroke='white').encode().project(type='identity', reflectY=True)
vancouver_map
# Now let's create another dataframe that we can use to plot points (in the correct location, based on latitude and longitude) of the total tree counts.
trees_by_hood = trees_filtered_df.groupby(by="neighbourhood_name").size().reset_index().rename(columns = {0: "tree_count"})
trees_by_hood
trees_by_hood_lat_lon = trees_filtered_df.groupby(by="neighbourhood_name").median(numeric_only=True).reset_index().drop(columns=["diameter"])
trees_by_hood_lat_lon
map_trees_df = pd.merge(trees_by_hood, trees_by_hood_lat_lon, left_on='neighbourhood_name', right_on="neighbourhood_name", how="inner")
map_trees_df
| neighbourhood_name | tree_count | level_0 | index | latitude | longitude | |
|---|---|---|---|---|---|---|
| 0 | Arbutus-Ridge | 32 | 1514.5 | 2050.0 | 49.251766 | -123.161059 |
| 1 | Downtown | 44 | 1861.0 | 2736.0 | 49.279161 | -123.120819 |
| 2 | Dunbar-Southlands | 47 | 1320.0 | 2482.0 | 49.244350 | -123.186220 |
| 3 | Fairview | 25 | 1500.0 | 1918.0 | 49.263053 | -123.129507 |
| 4 | Grandview-Woodland | 42 | 1550.5 | 2043.5 | 49.271694 | -123.064417 |
| 5 | Hastings-Sunrise | 82 | 1525.5 | 2555.0 | 49.275150 | -123.043930 |
| 6 | Kensington-Cedar Cottage | 94 | 1613.0 | 2210.5 | 49.242945 | -123.074047 |
| 7 | Kerrisdale | 47 | 1471.0 | 2295.0 | 49.229408 | -123.154256 |
| 8 | Killarney | 44 | 1683.5 | 2376.5 | 49.220517 | -123.035917 |
| 9 | Kitsilano | 38 | 1405.0 | 2731.5 | 49.262380 | -123.153851 |
| 10 | Marpole | 50 | 1777.5 | 2896.5 | 49.212110 | -123.129391 |
| 11 | Mount Pleasant | 35 | 1909.0 | 1998.0 | 49.262858 | -123.099438 |
| 12 | Oakridge | 45 | 1451.0 | 2799.0 | 49.227775 | -123.123889 |
| 13 | Renfrew-Collingwood | 91 | 1510.0 | 2268.0 | 49.245406 | -123.040583 |
| 14 | Riley Park | 62 | 1701.0 | 2644.5 | 49.245382 | -123.100527 |
| 15 | Shaughnessy | 42 | 877.0 | 2667.5 | 49.243628 | -123.139753 |
| 16 | South Cambie | 29 | 1887.0 | 2427.0 | 49.246578 | -123.119656 |
| 17 | Strathcona | 12 | 1576.5 | 1852.5 | 49.282518 | -123.091634 |
| 18 | Sunset | 70 | 1577.0 | 2752.5 | 49.221410 | -123.093632 |
| 19 | Victoria-Fraserview | 73 | 1760.0 | 2373.0 | 49.220275 | -123.064658 |
| 20 | West End | 25 | 1821.0 | 2168.0 | 49.285731 | -123.131542 |
| 21 | West Point Grey | 24 | 849.0 | 2126.0 | 49.264220 | -123.208290 |
# We can use the above dataframe as the basis of our 'points' visualization. Let's make the points white, with a black stroke.
points = alt.Chart(map_trees_df).mark_circle(stroke="black").encode(
longitude='longitude',
latitude='latitude',
size=alt.Size('tree_count:Q', title="Tree count"),
color=alt.Color(value='white'),
tooltip=[alt.Tooltip('neighbourhood_name:N', title='Neighbourhood'), alt.Tooltip('tree_count:Q', title='Total number of trees')]).project(type= 'identity', reflectY=True)
points
# To achieve the interactivity I would like in my final dashboard, I will create another layer to my map. The neighbourhoods, once clicked on in my heat map chart, will be highlighted in this map layer.
# I will make this layer green, to demonstrate that Vancouver is a "green" city of trees.
van_map = alt.Chart(data_geojson_remote).mark_geoshape().transform_lookup(
lookup='properties.name',
from_=alt.LookupData(map_trees_df, 'neighbourhood_name', ['tree_count', 'neighbourhood_name'])).encode(
opacity = alt.condition(select_neighbourhood_click, alt.value(1), alt.value(0.2)),
color = alt.Color(value="#005C29"),
tooltip=[alt.Tooltip('neighbourhood_name:N', title='Neighbourhood'), alt.Tooltip('tree_count:Q', title='Total number of trees')]).project(type='identity', reflectY=True).transform_filter(select_neighbourhood_click).add_params(select_neighbourhood_click)
# Combining all of the maps together creates an object that I can use in my dashboard.
points_map = vancouver_map + van_map + points
points_map
Interactive Dashboard#
# Now finally, let's combine all of our plots.
# Let's make sure to transform the tree plot according to the legend_select, and add both the select_neighbourhood_click and legend_select selections.
tree_plot_filtered = tree_plot_filtered.transform_filter(legend_select).add_params(select_neighbourhood_click, legend_select)
# I will add a title to indicate that the data demonstrate that Vancouver has a large distribution of tree species within all neighbourhoods.
overall_title = alt.TitleParams(
"Vancouver is a city of trees!",
subtitle = "Top 10 tree species well represented within all neighbourhoods",
anchor = 'middle',
fontSize = 20,
subtitleFontSize = 16)
(tree_plot_filtered | year_chart & points_map).properties(title=overall_title)
This dashboard visualization nicely allows a user to interact between three variables: species name, neighbourhood name, and year planted. By clicking between the heat map and bar and line plots, the number of different trees of each species, per neighbourhood and year, can be visualized. The map at the bottom doesn’t allow a user to click on it and interact with it, but rather just displays where within Vancouver each nieghbourhood can be found. The points on the map also nicely summarize the total tree count (over all the years and species) in each neighbourhood.
Bonus - some extra additions… widgets!#
I made the conscious choice of not using widgets on my dashboard because I liked the elegant interactivity of clicking and selecting between the above plots. After a considerable amount of time playing around with different widget options, I decided that widgets don’t really add to the above visualization, and rather clutter and complicate it.
Nevertheless, to demonstrate my ability to add widgets to charts, I have added slider and dropdown widgets to a scatter plot below.
# Here I explored how tree height range and diameter are influenced by species and neighbourhood.
# I used a slider to choose the height_range_id to highlight in the scatter plot by changing its size. I added dropdowns to enable species and neighbourhood selection.
# On this chart, I adjusted the scale and size of the plot to zone in on the data. There was one outlier point, with very large diameter, which I decided to "clip" to enable better visualization of the other points.
scatter_plot = alt.Chart(trees_filtered_df).mark_point(clip=True).encode(
x=alt.X('height_range_id', scale=alt.Scale(domain=[0,10]), axis=alt.Axis(tickCount=9), title="Height range id (scale of 1 to 9)"),
y=alt.Y('diameter', scale=alt.Scale(domain=[0,40]), title = "Diameter (in)")).properties(title="Tree diameter vs. height range")
slider_height = alt.binding_range(name='Height range ', min=1, max=9, step=1)
select_height = alt.selection_point(
fields=['height_range_id'],
bind=slider_height,
)
neighbourhoods = sorted(trees_filtered_df['neighbourhood_name'].unique())
dropdown_neighbourhoods = alt.binding_select(name='Neighbourhood ', options=neighbourhoods)
select_neighbourhood = alt.selection_point(fields=['neighbourhood_name'], bind=dropdown_neighbourhoods)
species = sorted(top10_trees)
dropdown_species = alt.binding_select(name='Species ', options=species)
select_species = alt.selection_point(fields=['species_name'], bind=dropdown_species)
scatter_plot = scatter_plot.add_params(select_neighbourhood, select_species, select_height).encode(
opacity=alt.condition(select_neighbourhood, alt.value(1), alt.value(0.05)),
size = alt.condition(alt.datum.height_range_id < select_height.height_range_id, alt.value(100), alt.value(10)),
color=alt.condition(select_species, alt.value('purple'), alt.value('gray'))).properties(height=500, width=800)
scatter_plot
The above interactive plot demontrates that too many interactive options on a plot make things too confusing, and don’t add information. Also, the points somewhat obstruct each other. As mentioned before, I felt that my interactive dashboard was already complete without widgets, so made the conscious choice of not adding them there.
Discussion and Concluding Remarks#
This final assignment demonstrated the amazing interactivity possible by Altair.
Given that the initial dataset we were given to work with only contained a subset of all of the trees, it is difficult to say whether or not the conclusions reached below are correct, but the following are a few observations/conclusions I made when interacting with the data:
Just a note to remind the reader that my dataframe was filtered to contain only the trees that contained a “date planted” value, and only for the top 10 species. So the total dataframe of 5000 trees was cut down to one of only 1053.
The top 10 species are quite well represented across all neighbourhoods, with most neighbourhoods containing at least 7 of the 10 species.
Only Strathcona falls below this cut-off, with only 6 of the top 10 species represented. However, when we look at the total tree count within Strathcona (via the tooltip on the Vancouver map), we see that this neighbourhood also has only 12 trees total (within this filtered dataset). Renfrew-Collingwood had the largest number of trees, 91. It would be interesting to create a map indicating average trees/area across the city. This would be a better way of comparing the neighbourhoods, as certain neighbourhoods are larger than others, and so just “total trees” is not be directly comparable between a large neighbourhod and a small one. Regardless, when looking at species distribution, most species are very well represented throughout the city.
There was quite a good split of different species being planted each year, with almost all species having several trees planted across the city each year.
Initially, when I created my EDA, I looked at ALL of the different tree species within the dataset, and created an area chart to compare which trees were being planted each year. This was WAY too much information. I decided, in this assignment, to narrow down to the top 10 species. This is, however, still a lot of different data to look at. I think the interactive bar chart would be particularly helpful if a user was interested in comparing 2 or 3 of the different species, and their planting trends over a period of time.
When I started this analysis, I wanted to compare the prevalence of deciduous and evergreen trees. I quickly realized, however, that the majority of the top 25-30 most common species in the dataset are deciduous. This was quite interesting and surprising. It seems as though the City of Vancouver prefers planting deciduous trees, as opposed to the evergreen trees that are native to this area (cedar, douglas fir, spruce, etc.) Perhaps these trees are already so common in the city, that the choice is made not to plant them? It would be interesting to look into this further.
2016 was a terrible year for planting trees.
As mentioned previously, it would be interesting to see what happened politically in Vancouver in this year, or whether parks board funding was cut for some reason, or what happened to cause the terrible planting year in 2016.
I used a combination of plots to answer my questions, including a:
heat map
bar chart
line chart
geographic map
scatter plot
References#
Vancouver trees dataset: https://opendata.vancouver.ca/explore/dataset/street-trees/information/?
Data Visualization sample final project for inspiration and coding help
Data Visualization course notes for coding examples and syntax